import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline
'''Logins json'''
with open ("logins.json") as f:
logins=json.load(f)
loginsDf=pd.DataFrame.from_dict(logins)
print(loginsDf.shape)
loginsDf.head(3)
loginsDf.to_csv("logins.csv")
'''ultimate_data_challenge json'''
import json
with open ("ultimate_data_challenge.json") as f:
data=json.load(f)
df=pd.DataFrame.from_dict(data)
print(df.shape)
df.head(3)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("logins.csv", header=True,inferSchema=True)
df.columns
df.printSchema()
df.show(5)
'''Create a Spark Table View'''
df.createOrReplaceTempView("logins")
spark.sql("select * from logins").show(10)
spark.sql("select min(login_time), max(login_time) from logins").show()
aggTbl=spark.sql('''
select
timekey2,
count(timekey2) as ts_cnt,
max(login_time) as ts_max
from
(
(select login_time,
UNIX_TIMESTAMP(login_time) as timekey1,
floor(UNIX_TIMESTAMP(login_time)/(15 * 60)) as timekey2
from logins
order by login_time) as t1)
group by timekey2
order by timekey2
''')
df=pd.DataFrame(
aggTbl.collect(),
columns=aggTbl.schema.names
)
print(df.shape)
print(df.head(10))
df["time_max"]=[item.strftime("%H:%M:%S") for item in df.ts_max]
df["date_max"]=[item.strftime("%Y-%m-%d") for item in df.ts_max]
Summary: The original dataset collects records of user login from the date 2014/01/01 to 2014/07/01. The following plot shows the counts of logins per 15 mins intervals. As shown, we could oberve that there might be recognized patterns, so that we will take a closer look onto the daily login cycles below.
df.plot(x="ts_max", y="ts_cnt", figsize=(15,5))
plt.show()
df.head(10)
This plot shows the ranking of login in daily basis. As we can see, the maximum number of logins hit about 1800 times on 1970/04/04; The least number of logins is only about 100 times on 1970/01/01. The login frequency varies quite a lot during this half year period.
df1=df.groupby(["date_max"], as_index=False).sum()
df1=df1.sort_values(by="ts_cnt")
df1.plot.barh(y="ts_cnt",x="date_max", figsize=(15,25), legend=False)
plt.title("Login Count Per day")
plt.ylabel("Date")
plt.xlabel("Login Counts")
plt.xticks(fontsize=10)
plt.show()
As shown below, the login frequecy varies during the day. However, there is particular forms are identified, e.g. High peak / traffic of login happens between 10am-2pm and 10pm-2am periods in most of the days; It usually have low login traffics between 5am-10am. Most of the peak counts are approximately 20, but some of the higher counts hit 20 logins.
'''Function to plot Daily Login Count '''
def plotLoginCntFunc(df,date):
df2=df[df["date_max"]== date]
totalCnt=df2.groupby(["date_max"], as_index=False).sum()["ts_cnt"]
df2.plot(x="time_max", y="ts_cnt", figsize=(15,5), legend=False)
i=0
timeLst=[]
for item in list(df2.time_max):
if i%2==1:
timeLst.append(item)
i+=1
else:
timeLst.append("")
i+=1
plt.xticks(ticks=np.arange(len(list(df2.time_max))),labels=timeLst,rotation=70)
plt.title("Date: %s - Total Count: %d"%(date, totalCnt))
plt.xlabel("Time")
plt.ylabel("Login Count")
plt.show()
dateLst=sorted(list(set(df["date_max"])))
for i in range(0,len(dateLst)):
date=dateLst[i]
plotLoginCntFunc(df,date)
1.) What would you choose as the key measure of success of this experiment in encouraging driver partners to serve both cities, and why would you choose this metric?
Ans: I would choose the "Toll costs" as the key measure for this experiment. Since the company design the incentive which reimbursing "Toll Costs" attracts driver to serve both cities, So that I could assume that the more toll fee claim by the driver, the more the driver serving both cities.
2a.) How to set up the experiment?
Ans: The company should first record every trip of the drivers and then calculate the ratio of the driver serves both cities (i.e. Let's call it Serve_Ratio). For example, if the driver serves Gotham 5 trips/day, serves Metropolis 10 times/ day, the rato equals 5/10=0.5. In other words, the ratio would be close to 1 if a driver constantly serves both cities (i.e. NOT exclusive only to one city)
2b.)what statistical test(s) you will conduct to verify the significance of the observation?
Ans: After recording the Serve_Ratio and Toll_Cost, we are able to proceed to implement a Hypothesis test to verify if "Reimbursing Toll Cost" would encourage driver serving both cities. In other words, we would find out whether Toll_cost would correlate to the Serve_ratio. For the test, we first need to set up the hypothesis which is shown as below. Once we have the test result, we could conclude whether these two factors have straight relationshp. For example, if P-Value < 0.05, then we could reject the Null Hypothesis which Toll_Cost do NOT correlate to Serve_Raio. In other words, we will strong evidence to show that Toll_Cost correlates to Serve_Raio (The higher the Toll_Cost, the Serve_Ratio more close to 1)
*- Null: Toll_Cost do NOT correlate to Serve_Raio*
*- Alt: Toll_Cost correlates to Serve_Raio*
2c.) How you would interpret the results and provide recommendations to the city operations team along with any caveats.
Ans: There are several recommendations with caveats:
* - City would need to grant more incentives to drivers for passing toll-bridge (e.g. No need to query in line to pass)*
* - Cities might be cautious to plan heavier traffic crossing bridge with more drivers serve both cities *
The following analysis is based on dataset given by a company which is called "Ultimate". The features of the dataset includes activities of anonymous drivers as shown below. The purpose of this report is to discover any specific pattern from this dataset and to build a predictive model to forecast whether a driver would be still active using "Ultimate" service in their 6th month in the system. For the machine learning purpose, target labels are then formatted as below:
<font, color="blue">label = 1 </font> if last trip date > 2014-05-31 OR <font, color="blue">label = 0 </font> if last trip date <= 2014-05-31
import pandas as pd
import numpy as np
import json
import pickle
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
'''ultimate_data_challenge json'''
import json
with open ("ultimate_data_challenge.json") as f:
data=json.load(f)
df=pd.DataFrame.from_dict(data)
print(df.shape)
pd.set_option('display.max_colwidth', -1)
'''Convet columns to datetime'''
df["last_trip_date"] = pd.to_datetime(df.last_trip_date)
df["signup_date"] = pd.to_datetime(df.signup_date)
'''
Add Target label Column
last_trip_date > 2014/05/31 will be assigned as 1, otherwise 0
'''
df["label"]=np.zeros(df.shape[0])
df["label"][df["last_trip_date"] > pd.to_datetime("2014-05-31")] =1
df["label"]=list(map(int,df["label"]))
print("Original data Size: %d"%(df.shape[0]))
df.dropna(axis=0, inplace=True)
print("After Dropping missing value Size:", df.shape[0])
df.head(5)
set(df.signup_date)
df_dummy = pd.get_dummies(df, columns=["city", "phone", "ultimate_black_user"])
print(df_dummy.shape)
df_dummy.head(3)
The following plot which shows the proportion of Target labels Active drivers are about 20% more than the Non-Active driver. In this case, I would not consider this is a biased dataset.
label_count=Counter(df.label)
labelDf=pd.DataFrame.from_dict(dict(label_count), orient="index").reset_index()
explode = (0, 0.1) # only "explode" the 2nd slice (i.e. 'Hogs')
plt.figure(figsize=(5,5))
plt.pie(labelDf[0], explode=explode,
labels=["Not Active Driver","Active Driver"],
autopct="%1.1f%%",
shadow=True, startangle=90)
plt.title("Target Label Proportion", fontsize=15)
#plt.legend()
plt.show()
import seaborn as sns
fig=plt.figure(figsize=(15,15))
cols=["avg_dist","avg_rating_by_driver",
'avg_rating_of_driver', 'avg_surge', "surge_pct",
"trips_in_first_30_days","weekday_pct"]
for i in range(len(cols)):
ax1=fig.add_subplot(3,3,i+1)
df.boxplot(column=cols[i],
by="label",
ax=ax1,
showfliers=False ) # outliers are NOT shown
plt.xlabel("(0=Not Active, 1=Active)")
fig.suptitle("Boxplot for Continuous Variable vs Label", fontsize=14)
plt.show()
Feature City: Winterfell has the most users signed up "Ultimate" service in both group of driver. The group non-active driver is approximately double Winterfell and Astapor except King's Landing. Although the total count of driver is the least in King's Landing, the active driver is in second place among the three cities. Based on this result, the company could look into this metric to discover more info.
Feature Phone: The overall count of iPhone users is approximately double than Android users in both driver groups. The large portion of Android users are also non-activer users. The proportion of iPhone users evenly diversed into 2 groups.
Feature Ultimate-Black-user: The plot shows that total count of "Black" users is almost one half of the non-Black users. However, the active users in both group is the same. In this case, I would believe this relate to incentive to attract users to sign up the "Ultimate-Black" service.
fig= plt.figure(figsize=(15,20))
cols=["city", "phone","ultimate_black_user"]
for i in range(len(cols)):
ax=fig.add_subplot(3,1,i+1)
ct=pd.crosstab(df[cols[i]], df.label)
ct.plot.bar(ax=ax)
plt.legend(["Not Active = 0", "Active = 1"])
plt.xlabel("City", fontsize=15)
plt.ylabel("Member Counts", fontsize=15)
plt.title("%s vs member activity" %(cols[i]), fontsize=20)
plt.xticks(rotation=0)
plt.show()
Summary: The correlation table below indicates the how the features correlates with each other. Firstly, let's take a look of how the predictor features react with Target label. The corrlelation values overall is not high versus the label, the highest value is about 0.24 regardless the sign of correlations. In this case, the most noticeable correlated features with label are "avg_rating_by_driver", "avg_surge", "surge_pct" and "ultimate_black_user" ranging 0.22-0.24 in terms of correlation values. This result illustrates the similar pattern comparing with the above plots. In addition, we could also observe the features "avg_surge" and "surge_pct" wahich are highly correlated with each other have 0.99 correlation value, this result shows that our findings above alignes with the computed correlation values. We should pay more attention to the result of machine learning part to see how the contribution of these features to the models.
import seaborn as sns
plt.figure(figsize=(7,7))
ax=sns.heatmap(df.corr(method="spearman"), annot=True)
plt.title("Correlation Table")
#plt.xticks(rotation=45)
plt.show()
'''
Calculate the days difference from signup day to last trip day
i.e. signup_last_date_diff = signup_date - last_trip_date
'''
df_dummy["signup_last_date_diff"] = df_dummy["last_trip_date"]- df_dummy["signup_date"]
df_dummy["signup_last_date_diff"]=df_dummy["signup_last_date_diff"].dt.days
df_dummy.head(5)
cols=["avg_dist","avg_rating_by_driver", "avg_rating_of_driver", "avg_surge","surge_pct",
"trips_in_first_30_days", "weekday_pct","signup_last_date_diff"]
# , "last_trip_month", "signup_month"]
for col in cols:
df_dummy[col] = (df_dummy[col] -df_dummy[col].mean() )/ df_dummy[col].std()
df_dummy.head(5)
df_dummy.drop(columns = ["last_trip_date", "signup_date", "signup_last_date_diff"], inplace=True)
df_dummy.head(5)
import pickle
with open("./data/XnormalizedDataPickle", "wb") as file:
pickle.dump(np.array(df_dummy.drop(columns=["label"])),file)
with open("./data/ynormalizedDataPickle", "wb") as file:
pickle.dump(np.array(df_dummy["label"]),file)
with open("./data/XnormalizedDataPickle", "rb") as file:
X = pickle.load(file)
with open("./data/ynormalizedDataPickle", "rb") as file:
y = pickle.load(file)
print(X.shape, y.shape)
#print(X.head(3), y.head(3))
Summary: Let's recall the goal of this project which is to forecast whether the driver is still active in their 6th month after signing up the Ultimate service. In this case, we should focus finding out drivers would like to stay suing Ultimate servce which means we need to find out drivers with label = 1. For this reason, I would propose to use overall Accuracy and Recall (i.e. TPR) as the primary metric to measure the performance of classifiers.
The following shows the performance of models (i.e. <font, color="blue"> Logistic Regression, Gradient Boosting Classifier and Random Forest Classifier</font>). The plots show the overall Accuracy as well as Recall metrics for the Classifiers. The models were built by using 10 Cross-Validation* method, the performance graphs show how the metrics varies for each learning iteration. As observed, all three classifers perform differently by training the given dataset, the average Accuracy is around <font, color="blue"> 69-78%</font>; the average Recall is also around <font, color="blue"> 70-75%</font>.
Overall Average Accuracy of Train and Test set aligns in range 70-72%. Average AUC value is around 75%. There is NO overfitting or under-fitting found from this model.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
clf= LogisticRegression(random_state=1)
cv=10
scoring = {'accuracy': make_scorer(accuracy_score)}
#{'auc': 'roc_auc'}
grid = {
"penalty": ["l1","l2"],
"C": [0.1,1,10]
}
gsearch = GridSearchCV(estimator=clf,
param_grid= grid,
cv=cv,
scoring=scoring,
refit="accuracy",
return_train_score=True)
gsearch.fit(X, y)
'''
Plot GridSearch Accuracy and Recall
'''
def plot_logistic_gridsearch(train_or_test, metric, cv):
lst=[]
for i in range(0,cv):
if i ==0: lst= gsearch.cv_results_["split%s_%s_%s"%(str(i), train_or_test, metric)]
else: lst=np.vstack((lst, gsearch.cv_results_["split%s_%s_%s"%(str(i), train_or_test, metric)]))
colNames=[]
for i in range(len(gsearch.cv_results_["params"])):
p=gsearch.cv_results_["params"][i]['penalty']
c=gsearch.cv_results_["params"][i]['C']
string = "C:%s , penalty:%s" %(str(c),str(p))
colNames.append(string)
metricDf=pd.DataFrame(lst, columns= colNames)
#fig=plt.figure(figsize=(15,8))
metricDf.plot(figsize=(15,5))
plt.xticks(ticks=np.arange(0,10), labels=np.linspace(1,11,11, dtype=int))
plt.xlabel("CV")
plt.ylabel("%s"%(metric))
plt.title("%s Set %s" %(train_or_test, metric))
plt.legend()
plt.show()
plot_logistic_gridsearch("train", "accuracy", cv)
plot_logistic_gridsearch("test", "accuracy", cv)
print("Best Parameters:%s with score %s%% based on Accuracy"%(gsearch.best_params_, str(round(gsearch.best_score_*100,2))))
plot_logistic_gridsearch("train", "auc", cv)
plot_logistic_gridsearch("test", "auc", cv)
print("Best Parameters:%s with score %s%% based on AUC"%(gsearch.best_params_, str(round(gsearch.best_score_*100,2))))
Overall Average Accuracy of Train is above 98%, but Accuracy of Test set is only about 75%. Average Recall of Train set is above 98%, but Recall is only about 70%. In this case, Over-fitting happens in this model.
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
'''
Train Classifier by using Cross Validation Method
'''
scoring=["accuracy", "recall", "precision"] #Measuring Metrics
cv=10 #Number of Cross Validation Partitions
#Store Train and Test Score
train_accuracy_lst, train_recall_lst, train_precision_lst=[],[],[]
test_accuracy_lst,test_recall_lst, test_precision_lst=[],[],[]
#List to store "Number of Estimator"
n_estimator_lst = [100,200,500,700,1000]
for n_estimator in n_estimator_lst:
clf= RandomForestClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
scores=cross_validate(clf, X,y, cv=cv,scoring=scoring, return_train_score=True)
train_precision_lst.append(scores["train_precision"])
train_recall_lst.append(scores["train_recall"])
train_accuracy_lst.append(scores["train_accuracy"])
test_precision_lst.append(scores["test_precision"])
test_recall_lst.append(scores["test_recall"])
test_accuracy_lst.append(scores["test_accuracy"])
'''
Store Measurment metrics into DataFrame
'''
metricDf=pd.DataFrame(list(zip(n_estimator_lst,
train_accuracy_lst,train_recall_lst, train_precision_lst,
test_accuracy_lst,test_recall_lst, test_precision_lst)),
columns=["n_estimator_lst",
"train_accuracy_lst","train_recall_lst", "train_precision_lst",
"test_accuracy_lst","test_recall_lst", "test_precision_lst"])
'''
Plot Measurment metrics from the above trained models.
'''
fig=plt.figure(figsize=(15,15))
#Metric 1
ax1=fig.add_subplot(3,1,1)
for i in range(0,metricDf["train_accuracy_lst"].shape[0]):
ax1.plot(metricDf["train_accuracy_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
ax1.plot(metricDf["test_accuracy_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Accuracy vs n_estimators ")
plt.ylabel("Accuracy")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
#Metric 2
ax1=fig.add_subplot(3,1,2)
for i in range(0,metricDf["train_recall_lst"].shape[0]):
ax1.plot(metricDf["train_recall_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
ax1.plot(metricDf["test_recall_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Recall vs n_estimators ")
plt.ylabel("Recall")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
#Metric 3
ax1=fig.add_subplot(3,1,3)
for i in range(0,metricDf["train_precision_lst"].shape[0]):
ax1.plot(metricDf["train_precision_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
ax1.plot(metricDf["test_precision_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Precision vs n_estimators ")
plt.ylabel("precision")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
plt.show()
def Barplot_avg_metric(x, y, metric):
fig = plt.figure(figsize=(15,5))
# ax=fig.add_subplot(1,3,axNo)
plt.bar(x=x, height=y, width=50)
plt.ylim(top=max(y)+0.02, bottom=min(y)-0.02)
plt.xticks(ticks=np.arange(0,1100,100))
plt.title("%s vs n_estimator" %(metric))
plt.ylabel("%s" %(metric))
plt.xlabel("n_estimators")
plt.show()
avg_test_recall_lst, avg_test_accuracy_lst, avg_test_precision_lst=[], [], []
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
avg_test_accuracy_lst.append(np.mean(metricDf["test_accuracy_lst"][i]))
for i in range(0,metricDf["test_recall_lst"].shape[0]):
avg_test_recall_lst.append(np.mean(metricDf["test_recall_lst"][i]))
for i in range(0,metricDf["test_precision_lst"].shape[0]):
avg_test_precision_lst.append(np.mean(metricDf["test_precision_lst"][i]))
Barplot_avg_metric(n_estimator_lst,avg_test_accuracy_lst, "Accuracy")
Barplot_avg_metric(n_estimator_lst,avg_test_recall_lst, "Recall")
Barplot_avg_metric(n_estimator_lst,avg_test_precision_lst, "Precision")
Overall Average Accuracy of Train and Test set align 78-80%; Average Recall of Train and Test set aligns around 70-73%. There is No over-fitting or under-fitting found from this result
from sklearn.model_selection import cross_validate
from sklearn.ensemble import GradientBoostingClassifier
'''
Train Classifier by using Cross Validation Method
'''
scoring=["accuracy", "recall", "precision"] #Measuring Metrics
cv=10 #Number of Cross Validation Partitions
#Store Train and Test Score
train_accuracy_lst, train_recall_lst, train_precision_lst=[],[],[]
test_accuracy_lst,test_recall_lst, test_precision_lst=[],[],[]
#List to store "Number of Estimator"
n_estimator_lst = [100,200,500,700,1000]
for n_estimator in n_estimator_lst:
clf= GradientBoostingClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
scores=cross_validate(clf, X,y, cv=cv,scoring=scoring, return_train_score=True)
train_precision_lst.append(scores["train_precision"])
train_recall_lst.append(scores["train_recall"])
train_accuracy_lst.append(scores["train_accuracy"])
test_precision_lst.append(scores["test_precision"])
test_recall_lst.append(scores["test_recall"])
test_accuracy_lst.append(scores["test_accuracy"])
'''
Store Measurment metrics into DataFrame
'''
metricDf=pd.DataFrame(list(zip(n_estimator_lst,
train_accuracy_lst,train_recall_lst, train_precision_lst,
test_accuracy_lst,test_recall_lst, test_precision_lst)),
columns=["n_estimator_lst",
"train_accuracy_lst","train_recall_lst", "train_precision_lst",
"test_accuracy_lst","test_recall_lst", "test_precision_lst"])
'''
Plot Measurment metrics from the above trained models.
'''
fig=plt.figure(figsize=(15,15))
#Metric 1
ax1=fig.add_subplot(3,1,1)
for i in range(0,metricDf["train_accuracy_lst"].shape[0]):
ax1.plot(metricDf["train_accuracy_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
ax1.plot(metricDf["test_accuracy_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Accuracy vs n_estimators ")
plt.ylabel("Accuracy")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
#Metric 2
ax1=fig.add_subplot(3,1,2)
for i in range(0,metricDf["train_recall_lst"].shape[0]):
ax1.plot(metricDf["train_recall_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_recall_lst"].shape[0]):
ax1.plot(metricDf["test_recall_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Recall vs n_estimators ")
plt.ylabel("Recall")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
#Metric 3
ax1=fig.add_subplot(3,1,3)
for i in range(0,metricDf["train_precision_lst"].shape[0]):
ax1.plot(metricDf["train_precision_lst"][i], label="train "+str(n_estimator_lst[i]), linestyle="--")
for i in range(0,metricDf["test_precision_lst"].shape[0]):
ax1.plot(metricDf["test_precision_lst"][i], label="test "+str(n_estimator_lst[i]))
plt.xticks(ticks=np.arange(0,cv), labels=list(map(str,np.arange(1,cv+1))))
plt.title("Precision vs n_estimators ")
plt.ylabel("precision")
plt.xlabel("CV")
plt.legend(title="n_estimators", loc="right")
plt.show()
def Barplot_avg_metric(x, y, metric):
fig = plt.figure(figsize=(15,5))
# ax=fig.add_subplot(1,3,axNo)
plt.bar(x=x, height=y, width=50)
plt.ylim(top=max(y)+0.02, bottom=min(y)-0.02)
plt.xticks(ticks=np.arange(0,1100,100))
plt.title("%s vs n_estimator" %(metric))
plt.ylabel("%s" %(metric))
plt.xlabel("n_estimators")
plt.show()
avg_test_recall_lst, avg_test_accuracy_lst, avg_test_precision_lst=[], [], []
for i in range(0,metricDf["test_accuracy_lst"].shape[0]):
avg_test_accuracy_lst.append(np.mean(metricDf["test_accuracy_lst"][i]))
for i in range(0,metricDf["test_recall_lst"].shape[0]):
avg_test_recall_lst.append(np.mean(metricDf["test_recall_lst"][i]))
for i in range(0,metricDf["test_precision_lst"].shape[0]):
avg_test_precision_lst.append(np.mean(metricDf["test_precision_lst"][i]))
Barplot_avg_metric(n_estimator_lst,avg_test_accuracy_lst, "Accuracy")
Barplot_avg_metric(n_estimator_lst,avg_test_recall_lst, "Recall")
Barplot_avg_metric(n_estimator_lst,avg_test_precision_lst, "Precision")
Summary: In order to pick out the best classifier, t-tests were performed to verify the classification results (Discuss below). In addition, Receiver Operating Characteristics graph will be constructed to verify the TPR values. For the purpose searching a reliable and stable (<font, color="blue">i.e. No Over-fitting or Under-fitting</font>) model, G.B. and L.R. will be selected for further verify by t-test.
The t-test result indicates that performace of all three classifiers are distinctively different based on the Hypothesis as shown below. The P-value is smaller than t-value for all cases which means that the Null hypothesis shall be rejected, the alt. hypothesis shall be taken. With this result, the chart below shows that the Gradient Boosting generates the highest Accuacy and Recall values, so that we could conclude that G.B. model has the highest generalized capability (i.e. Accuracy=78%, Recall=69%) to forecast the driver activity. Furthermore, The ROC-AUC graph also proves that the the GB model generates average 85% AUC (i.e. Area under Curve) values with 10 Cross Validation. So that this result would be able to support our goal of this project to forecast the driver activity (i.e. label=1, Active users in the last month)
'''t-test to compare model Accuracy'''
from scipy.stats import ttest_ind
from sklearn.metrics import accuracy_score, recall_score
from sklearn.model_selection import train_test_split
import time
#scoring=["accuracy", "recall", "precision"]
sampling_times=50
n_estimator=1000
cv=5
'''Random Sampling Accuracy for Model 1'''
clf= GradientBoostingClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
acc_gb_lst, recall_gb_lst=[],[]
for i in range(0,sampling_times):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
acc_gb_lst.append(accuracy_score(y_test, y_pred))
recall_gb_lst.append(recall_score(y_test, y_pred))
'''Random Sampling Accuracy for Model 2'''
clf= RandomForestClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
acc_rf_lst, recall_rf_lst=[],[]
for i in range(0,sampling_times):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
acc_rf_lst.append(accuracy_score(y_test, y_pred))
recall_rf_lst.append(recall_score(y_test, y_pred))
'''Random Sampling Accuracy for Model 3'''
clf= LogisticRegression(C=0.1,penalty="l1",random_state=1)
acc_lr_lst, recall_lr_lst=[],[]
for i in range(0,sampling_times):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
acc_lr_lst.append(accuracy_score(y_test, y_pred))
recall_lr_lst.append(recall_score(y_test, y_pred))
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(acc_gb_lst, acc_lr_lst, equal_var=False)
print("Model Gradient Boosting Average Accuracy:", np.mean(acc_gb_lst))
print("Model Logistic Regression Average Accuracy:", np.mean(acc_lr_lst))
print("ttest_ind: t = %g p = %g" % (t, p))
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(recall_gb_lst, recall_lr_lst, equal_var=False)
print("Model Gradient Boosting Average Recall:", np.mean(recall_gb_lst))
print("Model Logistic Regression Average Recall:", np.mean(recall_lr_lst))
print("ttest_ind: t = %g p = %g" % (t, p))
a1=np.array(list(zip((np.mean(recall_gb_lst), np.mean( recall_lr_lst)))))
a2=np.array(list(zip((np.mean(acc_gb_lst), np.mean( acc_lr_lst)))))
print("Result Table" )
ttestDf=pd.DataFrame(np.hstack((a1,a2)), columns=["Avg. Recall", "Avg. Average"], index=["gradient boost", "logistic regression"])
ttestDf.plot.bar(figsize=(5,7))
plt.ylim(top=max(acc_gb_lst)+0.03, bottom=min(recall_lr_lst))
plt.title("Average Recall and Average from t-test")
plt.xticks(rotation=45)
plt.axhline(y=np.mean(acc_gb_lst), linestyle="--", xmax=0.3)
plt.axhline(y=np.mean(acc_lr_lst), linestyle="--", xmax=0.8)
plt.show()
'''
Reload the Dataset (X and y)
'''
with open("./data/XnormalizedDataPickle", "rb") as file:
X = pickle.load(file)
with open("./data/ynormalizedDataPickle", "rb") as file:
y = pickle.load(file)
#print(X.shape, y.shape)
#print(X.head(3), y.head(3))
'''
Compute ROC and AUC based on the best model determined from above
'''
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
# #############################################################################
cv = StratifiedKFold(n_splits=10)
#classifier = svm.SVC(kernel='linear', probability=True, random_state=random_state)
clf= GradientBoostingClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
#tprs = []
#aucs = []
#mean_fpr = np.linspace(0, 1, 100)
fig=plt.figure(figsize=(10,10))
for i,(train, test) in enumerate(cv.split(X, y)):
#print(train)
probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve (TPR, FPR based on thresholds)
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
# Compute area the curve
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, alpha=0.3,label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red',label='Thresholds (Chance Line)', alpha=.8)
plt.title("Receiver Operating Characteristic (ROC) \n Gradient Boosting Classifer (n=200)")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend()
plt.show()
The following shows the proportion of importances of features for our best model (i.e. Gradient Boosting Classifier). The top four features (i.e. weekday_pct, avg_dist, surge_pct, avg_rating_by_driver) takes more than 50% of contribution to the model. Comparing this result with the boxplots above, we could have more confidence to tell how the two groups of labels being classified by these features.
There are some recommendations which might be help for longterm rider retention:
weekday_pct: The result shows that this feature is one of the important features to the model. So that Ultimate should take a closer look to drivers' habit to drive in weekdays. The analysis above shows that the more the driver drives in weekdays, the less chance the driver would stay with Ultimate.
avg_dist: Since this feature is one of main feature contributes to the model, so that Ultimate should pay more attention to the distance the drivers have done after the first month signed up. Based on the boxplot, the smaller range of distance the drivers drove, the more chance they would retain to use Ultimate service
surge_pct: The active driver tends to be use Ultimate with higher surge percent. In other words, Ultmate could offer more incentives to drivers if they intend to work in higher surge demand period.
avg_rating_by_driver: The active target group 1 driver has the wider range of rating (i.e. 4-5) by driver, Ultimate could take this as a reference to communicate more with these driver in order to retain these drivers.
clf= GradientBoostingClassifier(n_estimators= n_estimator, verbose=0, random_state=1)
clf.fit(X_train, y_train)
importanceDf=pd.DataFrame(list(zip(cols, importances)), columns=["features", "importances"]).sort_values(by=["importances"],
ascending=True)
importanceDf.plot.barh(x="features", y="importances", figsize=(7,7))
plt.xlabel("Importance Proportion")
plt.title("Model Importance Ranking (Gradient Boosting Classifier)")
plt.legend([])
plt.show()